Analysis of Cyber-security Threats
Abstract
The purpose of this data analysis was to find the most efficient
model for predicting financial loss due to cyber security attacks. Cyber
security attacks effect a number of different types of industries and
can be harmful not only to the privacy of the company, but can also
contribute to company losses. This analysis will explore LASSO, Ridge
and Elastic Net linear regression to find an optimal model in predicting
financial loss. It will also explore support vector regression to fit
the data to a certain margin of error and minimize complexity of the
model.
Introduction
How much money can be lost in a single cyber security threat? By
modeling and observing trends in cyber security threat data, this
analysis seeks to help reduce the financial impact of cyber security
threats. Limitations of this analysis include under reporting and data
bias. Many organizations will not report security breaches publicly to
avoid damage to their reputation. As well, threats can evolve quickly,
leading to models becoming outdated rapidly.
Overview of Data
The dataset was obtained from Kaggle and contains 10 variables that
are potential significant covariates in predicting potential threats.
The variable country refers to the country where the attack occurred.
The other variables analyze incident resolution time in hours, number of
affected users, financial loss, attack type, target industry, attack
source, security vulnerability type and defense mechanism used. Here is
a full list of variables:
- Country country where the attack occurred
- Year year of the incident
- Attack Type type of cyber-security threat
- Target Industry industry targeted
- Financial Loss estimated financial loss in
millions
- Number of Affected Users number of users impacted
by the attack
- Attack Source origin of the attack
- Security Vulnerability Type type of vulnerability
exploited
- Defense Mechanism Used cyber-security defense
strategy applied
- Incident Resolution Time time taken to fully
resolve event in hours
cyber <- read.csv('https://raw.githubusercontent.com/GabbyK8/Datasets/refs/heads/main/Global_Cybersecurity_Threats_2015-2024.csv')
cyber$Defense.Mechanism.Used <- ifelse(cyber$Defense.Mechanism.Used == "AI-based Detection", "AI", cyber$Defense.Mechanism.Used)
# plot 1 number of affected users by financial loss by vulnerability type
ggplot(cyber, aes(x = cyber$Number.of.Affected.Users, y = cyber$Financial.Loss..in.Million..., color = cyber$Security.Vulnerability.Type)) +
geom_smooth(aes(x = cyber$Number.of.Affected.Users, y = cyber$Financial.Loss..in.Million..., color = cyber$Security.Vulnerability.Type), method = "lm", se = FALSE) +
scale_size(range = c(2, 10)) +
labs(
title = "Plot of Security Vulnerability, Volume Compromised, and Financial Impact",
x = "Volume Compromised",
y = "Financial Impact",
color = "Security Vulernability"
) +
theme_minimal()
# plot 3 Density plot of attacked industries
Year_data <- cyber %>%
filter(Year == 2015)
p1<- ggplot(Year_data, aes(x = Number.of.Affected.Users, fill = Country)) +
geom_density(alpha = 0.6) +
labs(
x = "Number of Affected Users",
y = "Density",
fill="Country",
title = "Density Plot of Attacked Countries in 2015"
) +
theme_minimal()
ggplotly(p1)
Year_data1 <- cyber %>%
filter(Year == 2024)
# plot 4 Density plot of Attacked countries
p2 <- ggplot(Year_data1, aes(x = Number.of.Affected.Users, fill = Country)) +
geom_density(alpha = 0.6) +
labs(
x = "Number of Affected Users",
y = "Density",
fill="Country",
title = "Density Plot of Attacked Countries in 2024"
) +
theme_minimal()
ggplotly(p2)
Each country is given a specific color and the number of affected
users is displayed on the x-axis. The density is displayed on the
y-axis. Density in this plot shows the relative frequency of number of
affected users per attack in each country.In 2015, the cyber attacks
within the USA more frequently affected about 310,000 users, compared to
Russia where attacks more frequently attacked 670,000 users and Germany
where most attacks more frequently affected 790,000 users. In 2024, the
number of affected users in the USA does not seem to be much more dense
in any one number and is more dispersed. Japan had more frequent attacks
that affected 500,000 and Brazil had it’s most frequent at 750,000
users.
ggplot(cyber, aes(x = cyber$Year, y = cyber$Incident.Resolution.Time..in.Hours., color = cyber$Security.Vulnerability.Type)) +
geom_smooth(aes(x = cyber$Year, y = cyber$Incident.Resolution.Time..in.Hours., color = cyber$Security.Vulnerability.Type), method = "lm", se = FALSE) +
scale_size(range = c(2, 10)) +
labs(
title = "Plot of Security Vulnerability, Year of Attack, and Resolution Time",
x = "Year of Attack",
y = "Incident Resolution Time in Hours",
color = "Security Vulernability"
) +
theme_minimal()
#plot of zero-day attack type by target industry
zero_data <- cyber %>%
filter(Security.Vulnerability.Type == "Zero-day")
a <- ggplot(zero_data, aes(x = zero_data$Attack.Type, fill =zero_data$Target.Industry )) +
geom_bar(position = "dodge") +
labs(
x = "Threat Type",
y = "Proportion",
fill = "Target Industry",
title = "Zero-day Incidences by Threat Type vs. Target Industry"
) +
theme_minimal()
ggplotly(a)
Zero-day attacks seem to commonly use Phising within both retail and
IT industries, based on the height and count of the bar in these two
categories. Ransomware is the most minimally used zero-day attack on the
government, with a count of just 12.
col0=c("#4682B4", "#B4464B","#9900FF","#FF66FF", "#009933")
boxplot(Incident.Resolution.Time..in.Hours. ~ Defense.Mechanism.Used,
data=cyber,
xlab="Defense Mechanism",
ylab="Resolution Time (hours)",
col = col0,
main="Incident Resolution Time by Defense Mechanism",
cex.main = 0.9,
col.main = "blue")
The boxplot shows that the average resolution time among attacks
based on defense mechanisms is relatively the same. The only exception
seems to be firewall, which appears to take slightly less resolution
time on average.We can indicate the average by looking the black
horizontal line within each box. They all appear to line up with one
another indicating similar averages.
Skewness
cyber$Country[cyber$Country=="USA"]<-1
cyber$Country<- ifelse(cyber$Country!="USA",0)
cyber$Security.Vulnerability.Type[cyber$Security.Vulnerability.Type=="Zero-day"]<-1
cyber$Security.Vulnerability.Type<-ifelse(cyber$Security.Vulnerability.Type!="Zero-day",0)
cyber$Number.of.Affected.Users<-as.numeric(cyber$Number.of.Affected.Users)
cyber$Year<-as.numeric(as.factor(cyber$Year))
cyber$Incident.Resolution.Time..in.Hours.<-as.numeric(cyber$Incident.Resolution.Time..in.Hours.)
c<-skewness(cyber$Financial.Loss..in.Million...)
d<-skewness(cyber$Number.of.Affected.Users)
e<-skewness(cyber$Year)
Skewness = cbind(Financial.Loss = c,
Affected.Users = d,
Year = e)
pander(Skewness)
| -0.01684 |
-0.02537 |
-0.02749 |
The data is normally skewed and no transformation is necessary.
Regularized Regression Modeling
For this analysis, we will explore three types of regularized linear
regression modeling to find key predictors for financial loss from cyber
security attacks.
# Predictors (x) and Response Variable (y)
set.seed(12345)
X <- as.data.frame(cyber[, -5])
y <- cyber$Financial.Loss..in.Million...
train_index <- createDataPartition(y, p = 0.8, list = FALSE)
X_train <- makeX(X[train_index, ])
X_test <- makeX(X[-train_index, ])
y_train <- y[train_index]
y_test <- y[-train_index]
#standardizing the data
preprocess_params <- preProcess(X_train, method = c("center","scale"))
X_train <- predict(preprocess_params, X_train)
X_test <- predict(preprocess_params, X_test)
#fitting the model
fit_lasso <- glmnet(X_train, y_train, alpha = 1) # Lasso
fit_ridge <- glmnet(X_train, y_train, alpha = 0) # Ridge
fit_elastic_net <- glmnet(X_train, y_train, alpha = 0.5) # Elastic Net
#cross-validation
cv_lasso <- cv.glmnet(X_train, y_train, alpha = 1) # Lasso CV
cv_ridge <- cv.glmnet(X_train, y_train, alpha = 0) # Ridge CV
cv_elastic_net <- cv.glmnet(X_train, y_train, alpha = 0.5) # Elastic Net CV
#plot coefficient path
par(mar=c(5,4,6,3))
# Plot coefficient path
plot(fit_lasso, xvar = "lambda", label = TRUE,
lwd = 1.5,
main = "Coefficient Path Analysis: LASSO",
cex.main = 0.9,
col = rainbow(10))
abline(v = log(1), col = "purple", lty = 4, lwd = 2)
abline(v = -1, col = "steelblue", lty = 2, lwd = 2)
pander(data.frame(colnames(X_train)[c(27, 5, 20, 12)]),caption= "Feature Variables at log(Lambda) = 0")
Feature Variables at log(Lambda) = 0
| Incident.Resolution.Time..in.Hours. |
| Attack.TypeMan-in-the-Middle |
| Attack.SourceUnknown |
| Target.IndustryHealthcare |
pander(data.frame(colnames(X_train)[c(18, 12, 10, 24, 5, 23, 17, 11, 3)]),caption= " Feature Variables at log(Lambda) = -1")
Feature Variables at log(Lambda) = -1
| Attack.SourceInsider |
| Target.IndustryHealthcare |
| Target.IndustryEducation |
| Defense.Mechanism.UsedEncryption |
| Attack.TypeMan-in-the-Middle |
| Defense.Mechanism.UsedAntivirus |
| Attack.SourceHacker Group |
| Target.IndustryGovernment |
| Attack.TypeDDoS |
The table above displays the feature variables observed in the
Coefficient Analysis plot.
par(mar=c(5,4,6,3))
##
plot(cv_lasso, main = "RMSE Plot: LASSO",
cex.main = 0.9)
# Extract coefficients for the best lambda
best.lasso.lambda <- cv_lasso$lambda.min
best.ridge.lambda <- cv_ridge$lambda.min
best.elastic.net.lambda <- cv_elastic_net$lambda.min
##
# Lasso Regression (L1 Regularization):
# CAUTION: model formula differs from the regular regression formula
lasso_model.opt <- glmnet(X_train,
y_train,
alpha = 1, # lasso regression
lambda = best.lasso.lambda,standardize = TRUE) # useser selected alpha, optimal lambda
# can be obtained through CV (see below)
lasso_predictions.opt <- predict(lasso_model.opt,
s = best.lasso.lambda, # user selected lambda value
# (regularization paremeter)
newx = X_test) # test data set
# The following RMSE of prediction serves as a validation - one step validation
lasso_rmse.opt <- sqrt(mean((y_test - lasso_predictions.opt)^2))
# Ridge Regression (L2 Regularization)
ridge_model.opt <- glmnet(X_train, y_train, alpha = 0, lambda = best.ridge.lambda,standardize=TRUE)
ridge_predictions.opt <- predict(ridge_model.opt, s = best.ridge.lambda, newx = X_test)
ridge_rmse.opt <- sqrt(mean((y_test - ridge_predictions.opt)^2))
# Elastic Net (Combination of L1 and L2)
elastic_net_model.opt <- glmnet(X_train, y_train, alpha = 0.5, lambda = 0.1,standardize = TRUE)
elastic_net_predictions.opt <- predict(elastic_net_model.opt, s = 0.1, newx = X_test)
elastic_net_rmse.opt <- sqrt(mean((y_test - elastic_net_predictions.opt)^2))
RMSE.opt = cbind(LASSO.opt = lasso_rmse.opt,
Ridge.opt = ridge_rmse.opt,
Elasticnet.opt = elastic_net_rmse.opt)
pander(RMSE.opt)
The optimal lambda for each type of linear regression is shown
above.
LASSO Regression Equation
Now, a final model will be extracted using LASSO, Ridge, and Elastic
Net.
#Lasso
# Extract coefficients with names from glmnet
coef_lasso <- coef(cv_lasso, s = best.lasso.lambda)
coef_df <- as.data.frame(as.matrix(coef_lasso))
coef_df$feature <- rownames(coef_df)
colnames(coef_df)[1] <- "coefficient"
# Filter non-zero coefficients
nonzero_coefs <- subset(coef_df, coefficient != 0)
# Separate intercept
intercept <- round(nonzero_coefs$coefficient[nonzero_coefs$feature == "(Intercept)"], 4)
nonzero_terms <- subset(nonzero_coefs, feature != "(Intercept)")
# Build model equation string
equation <- paste0(round(nonzero_terms$coefficient, 4), "*", nonzero_terms$feature)
cat("Model equation: y =", intercept, "+", paste(equation, collapse = " + "), "\n")
Model equation: y = 50.5369 + 0.6683*Attack.TypeDDoS + -0.4171*Target.IndustryEducation + 0.6626*Target.IndustryGovernment + -0.3464*Target.IndustryHealthcare + 0.265*Attack.SourceHacker Group + -0.5335*Attack.SourceInsider + 0.0065*Defense.Mechanism.UsedAntivirus
Ridge Regression Equation
# Extract coefficients with names from glmnet
coef_ridge <- coef(cv_ridge, s = best.ridge.lambda)
coef_df <- as.data.frame(as.matrix(coef_ridge))
coef_df$feature <- rownames(coef_df)
colnames(coef_df)[1] <- "coefficient"
# Filter non-zero coefficients
nonzero_coefs <- subset(coef_df, coefficient != 0)
# Separate intercept
intercept <- round(nonzero_coefs$coefficient[nonzero_coefs$feature == "(Intercept)"], 4)
nonzero_terms <- subset(nonzero_coefs, feature != "(Intercept)")
# Build model equation string
equation <- paste0(round(nonzero_terms$coefficient, 4), "*", nonzero_terms$feature)
cat("Model equation: y =", intercept, "+", paste(equation, collapse = " + "), "\n")
Model equation: y = 50.5369 + 0.0114*Year + 0.0521*Attack.TypeDDoS + -0.0125*Attack.TypeMalware + 0.0145*Attack.TypeMan-in-the-Middle + -0.0171*Attack.TypePhishing + -0.0201*Attack.TypeRansomware + -0.0186*Attack.TypeSQL Injection + 7e-04*Target.IndustryBanking + -0.0445*Target.IndustryEducation + 0.0576*Target.IndustryGovernment + -0.0398*Target.IndustryHealthcare + 0.0036*Target.IndustryIT + 0.0055*Target.IndustryRetail + 0.0187*Target.IndustryTelecommunications + 6e-04*Number.of.Affected.Users + 0.0427*Attack.SourceHacker Group + -0.0496*Attack.SourceInsider + 0.0182*Attack.SourceNation-state + -0.0106*Attack.SourceUnknown + -0.0045*Defense.Mechanism.UsedAI + 0.0233*Defense.Mechanism.UsedAntivirus + -0.0193*Defense.Mechanism.UsedEncryption + 0.0027*Defense.Mechanism.UsedFirewall + -0.0029*Defense.Mechanism.UsedVPN + -0.0122*Incident.Resolution.Time..in.Hours.
Elastic Net Regression Equation
# Extract coefficients with names from glmnet
coef_elastic <- coef(cv_elastic_net, s = best.elastic.net.lambda)
coef_df <- as.data.frame(as.matrix(coef_elastic))
coef_df$feature <- rownames(coef_df)
colnames(coef_df)[1] <- "coefficient"
# Filter non-zero coefficients
nonzero_coefs <- subset(coef_df, coefficient != 0)
# Separate intercept
intercept <- round(nonzero_coefs$coefficient[nonzero_coefs$feature == "(Intercept)"], 4)
nonzero_terms <- subset(nonzero_coefs, feature != "(Intercept)")
# Build model equation string
equation <- paste0(round(nonzero_terms$coefficient, 4), "*", nonzero_terms$feature)
cat("Model equation: y =", intercept, "+", paste(equation, collapse = " + "), "\n")
Model equation: y = 50.5369 + 0.4941*Attack.TypeDDoS + -0.2438*Target.IndustryEducation + 0.5433*Target.IndustryGovernment + -0.168*Target.IndustryHealthcare + 0.1465*Attack.SourceHacker Group + -0.3984*Attack.SourceInsider
lasso.error <- min(cv_lasso$cvm)
ridge.error <- min(cv_ridge$cvm)
enet.error <- min(cv_elastic_net$cvm)
model.errors <- data.frame(
Model = c("LASSO", "Ridge", "Elastic Net"),
CV_Error = c(lasso.error, ridge.error, enet.error)
)
pander(model.errors)
| LASSO |
829.7 |
| Ridge |
829.9 |
| Elastic Net |
830.3 |
Conclusion for Regularized Regression Modeling
The table above shows the MSE values or cross-validated errors for
our three models. A smaller value indicates a model with greater
performance. The LASSO linear regression model out performs both the
Ridge and Elastic Net model. When predicting financial loss caused by
cyber security threats, the LASSO liner regression model will be the
most efficient.
$$
= 50.5369 + 4.818 - 0.4171 + 2.324 - 0.3464 + 0.265 - 0.5335 +
0.0065
$$
Support Vector Regression
For this section of the analysis, linear and radial basis functions
will be fit using support vector regression and an ordinary least
squares regression model will be fit using step-wise variable
selection.
cyber <- read.csv('https://raw.githubusercontent.com/GabbyK8/Datasets/refs/heads/main/Global_Cybersecurity_Threats_2015-2024.csv')
#One-Hot Encoding
dummy <- dummyVars("~.", data=cyber)
cyber2<-data.frame(predict(dummy, newdata=cyber))
#####
# Split data into features (X) and target (y)
X <- cyber2[, -25] # All columns except the target variable
y <- cyber2[, 25] # Target variable (Financial Loss in Millions)
# create dummy variables
#####
# Split data into training and testing sets
set.seed(123)
train.index <- sample(1:nrow(cyber2), 0.8 * nrow(cyber2))
X.train <- X[train.index, ]
y.train <- y[train.index]
X.test <- X[-train.index, ]
y.test <- y[-train.index]
#####
cyber2.train <- as.data.frame(cyber2[train.index, ])
cyber2.test <- cyber2[-train.index, ]
#####
## Set up custom cross-validation control
my_tune_control <- tune.control(
cross = 5, # Use 5-fold cross-validation, the default is 10-fold cross-validation
nrepeat = 1 # Number of repetitions (for repeated cross-validation)
)
#####
# Perform grid search for hyperparameter tuning: RBF kernel is used
tune.RBF <- tune(svm,
train.x = X.train,
train.y = y.train,
ranges = list(epsilon = seq(0.1, 0.5, 0.1),
cost = c(1, 10, 100),
gamma = c(0.01, 0.1, 1)), # Hyperpar in RBF
tunecontrol = my_tune_control # 5-fold cross-validation
)
####
# Display the best parameters
#print(tune.RBF$best.parameters)
#####
# Train the final model using the best parameters
final.RBF<- svm(X.train, y.train,
type = "eps-regression", # Use "nu-regression" for nu-SVR
kernel = "radial",
epsilon = tune.RBF$best.parameters$epsilon,
cost = tune.RBF$best.parameters$cost,
gamma = tune.RBF$best.parameters$gamma)
#####
# Make predictions on the test set
pred.RBF <- predict(final.RBF, X.test)
# Evaluate performance
mse.RBF <- mean((y.test - pred.RBF)^2) # mean square error
mae.RBF <- mean(abs(y.test - pred.RBF)) # mean absolute error
#### linear kernel
# Perform grid search for hyperparameter tuning: RBF kernel is used
tune.lin <- tune(svm, train.x = X.train, train.y = y.train,
ranges = list(epsilon = seq(0.1, 0.5, 0.1),
cost = c(1, 10, 100)),
tunecontrol = my_tune_control # 5-fold cross-validation
)
####
# Display the best parameters
#print(tune.lin$best.parameters)
#####
# Train the final model using the best parameters
final.lin<- svm(X.train, y.train,
type = "eps-regression", # Use "nu-regression" for nu-SVR
kernel = "linear",
epsilon = tune.lin$best.parameters$epsilon,
cost = tune.lin$best.parameters$cost)
#####
# Make predictions on the test set
pred.lin <- predict(final.lin, X.test)
# Evaluate performance
mse.lin <- mean((y.test - pred.lin)^2) # mean square error
mae.lin <- mean(abs(y.test - pred.lin))
RBF <- tune.RBF$best.parameters
Linear <- tune.lin$best.parameters
pander(RBF, caption='RBF Best Parameters')
RBF Best Parameters
| 35 |
0.5 |
1 |
1 |
pander(Linear, caption='Linear SVR Best Parameters')
Linear SVR Best Parameters
| 5 |
0.5 |
1 |
The tables above show the fine tuned best parameters for both support
vector regression models.
## ordinary LSE regression model with stepwise variable selection
lse.fit <- lm(Financial.Loss..in.Million...~.,data=cyber2.train)
AIC.fit <- stepAIC(lse.fit,direction="both", trace = FALSE)
pred.lse <- predict(AIC.fit, X.test)
mse.lse <- mean((y.test - pred.lse)^2) # mean square error
mae.lse <- mean(abs(y.test - pred.lse)) # mean absolute error
###
par(mfrow=c(2,2), mar=c(2,2,2,2))
plot(AIC.fit)
The model above satisfies the linear regression assumptions and will
be used to help compare performance.
Comparing Regression Models
In this section we will look at the performance of our least squares
regression model after stepwise variable selection with our linear and
radial basis function models from support vector regression.We will
compare the model using MSE and MAE values.
###
Performance <- data.frame(RBF.SVR=c(mse.RBF, mae.RBF),
Linear.SVR = c(mse.lin, mae.lin),
LSE.Reg =c(mse.lse, mae.lse))
row.names(Performance) <- c("MSE", "MAE")
##
pander(Performance)
| MSE |
811.8 |
829.2 |
810.4 |
| MAE |
24.54 |
24.75 |
24.45 |
The linear and least squares model performed better than the radial
basis function model. The two models performed similarly, but the least
squares model performed slightly better and will be chosen as the best
model for the data.
Conclusion for Support Vector Regression
Based on our MSE and MAE scores, our least squares estimation has the
best model based on performance in comparison to support vector
regression on linear and radial basis function. It has lower MSE and MAE
scores indicating a better fit model. Our final best model is:
\[
\text{Financial Loss in Millions} = 51.27 + 4.818\times
\text{CountryGermany} - 3.525\times \text{CountryIndia} - 2.324\times
\text{AttackSourceInsider}
\]
---
title: "Regularized Regression Modeling on Financial Loss due to Cyber Threats"
author: "Gabriella Khalil"
date: "2025-03-26"
output:
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    number_sections: no
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
  word_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    keep_md: yes
  pdf_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: no
    fig_width: 3
    fig_height: 3
editor_options: 
  chunk_output_type: inline
---

```{=html}

<style type="text/css">

/* Cascading Style Sheets (CSS) is a stylesheet language used to describe the presentation of a document written in HTML or XML. it is a simple mechanism for adding style (e.g., fonts, colors, spacing) to Web documents. */

h1.title {  /* Title - font specifications of the report title */
  font-size: 22px;
  font-weight: bold;
  color: DarkRed;
  text-align: center;
  font-family: "Gill Sans", sans-serif;
}
h4.author { /* Header 4 - font specifications for authors  */
  font-size: 18px;
  font-weight: bold;
  font-family: system-ui;
  color: navy;
  text-align: center;
}
h4.date { /* Header 4 - font specifications for the date  */
  font-size: 18px;
  font-family: system-ui;
  color: DarkBlue;
  text-align: center;
  font-weight: bold;
}
h1 { /* Header 1 - font specifications for level 1 section title  */
    font-size: 22px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: center;
    font-weight: bold;
}
h2 { /* Header 2 - font specifications for level 2 section title */
    font-size: 20px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
    font-weight: bold;
}

h3 { /* Header 3 - font specifications of level 3 section title  */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - font specifications of level 4 section title  */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }

</style>
```

```{r setup, include=FALSE}
# code chunk specifies whether the R code, warnings, and output 
# will be included in the output files.
if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}
if (!require("tidyverse")) {
   install.packages("tidyverse")
library(tidyverse)
}
if (!require("GGally")) {
   install.packages("GGally")
library(GGally)
}

library(ggplot2)
library(plotly)
library(cowplot)
library(pander)
library(moments)
library(caret)
library(glmnet)
# General
library(e1071)        # For SVM       # For Ridge, Lasso, logistic regression
library(pROC)         # For ROC curves
library(Metrics)      # For MSE/MAE
library(MASS)
# Optional visualization
library(gridExtra)
knitr::opts_chunk$set(echo = TRUE,   # include code chunk in the output file
                      warning = FALSE,# sometimes, you code may produce warning messages,
                                      # you can choose to include the warning messages in
                                      # the output file. 
                      results = TRUE, # you can also decide whether to include the output
                                      # in the output file.
                      message = FALSE,
                      comment = NA
                      )  

```

# Analysis of Cyber-security Threats

## Abstract

The purpose of this data analysis was to find the most efficient model for predicting financial loss due to cyber security attacks. Cyber security attacks effect a number of different types of industries and can be harmful not only to the privacy of the company, but can also contribute to company losses. This analysis will explore LASSO, Ridge and Elastic Net linear regression to find an optimal model in predicting financial loss. It will also explore support vector regression to fit the data to a certain margin of error and minimize complexity of the model.

## Introduction

How much money can be lost in a single cyber security threat? By modeling and observing trends in cyber security threat data, this analysis seeks to help reduce the financial impact of cyber security threats. Limitations of this analysis include under reporting and data bias. Many organizations will not report security breaches publicly to avoid damage to their reputation. As well, threats can evolve quickly, leading to models becoming outdated rapidly.

## Overview of Data

The dataset was obtained from Kaggle and contains 10 variables that are potential significant covariates in predicting potential threats. The variable country refers to the country where the attack occurred. The other variables analyze incident resolution time in hours, number of affected users, financial loss, attack type, target industry, attack source, security vulnerability type and defense mechanism used. Here is a full list of variables:

-   **Country** country where the attack occurred
-   **Year** year of the incident
-   **Attack Type** type of cyber-security threat
-   **Target Industry** industry targeted
-   **Financial Loss** estimated financial loss in millions
-   **Number of Affected Users** number of users impacted by the attack
-   **Attack Source** origin of the attack
-   **Security Vulnerability Type** type of vulnerability exploited
-   **Defense Mechanism Used** cyber-security defense strategy applied
-   **Incident Resolution Time** time taken to fully resolve event in hours

```{r fig.align='center', fig.width=7, fig.height=5, fig.cap='This plot shows the relationship between security vulnerability type and the interaction between the number of affected users and financial loss in millions. In most cases the number of affected users seems to be positively correlated with financial loss. The only exception is seen in zero-day. zero-day refers to a type of vulnerability where the target of the attack is unaware of the attack. Meaning there are zero days for the problem to be solved before it has been exploited. It would make sense that in this type of vulnerability, the fewer users affected would lead to a greater financial loss.'}
cyber <- read.csv('https://raw.githubusercontent.com/GabbyK8/Datasets/refs/heads/main/Global_Cybersecurity_Threats_2015-2024.csv')
cyber$Defense.Mechanism.Used <- ifelse(cyber$Defense.Mechanism.Used == "AI-based Detection", "AI", cyber$Defense.Mechanism.Used)

# plot 1 number of affected users by financial loss by vulnerability type


ggplot(cyber, aes(x = cyber$Number.of.Affected.Users, y = cyber$Financial.Loss..in.Million..., color = cyber$Security.Vulnerability.Type)) +
geom_smooth(aes(x = cyber$Number.of.Affected.Users, y = cyber$Financial.Loss..in.Million..., color = cyber$Security.Vulnerability.Type), method = "lm", se = FALSE) +
  scale_size(range = c(2, 10)) +
  labs(
    title = "Plot of Security Vulnerability, Volume Compromised, and Financial Impact",
    x = "Volume Compromised",
    y = "Financial Impact",
    color = "Security Vulernability"
  ) +
  theme_minimal()
```

```{r fig.align='center', fig.width=7, fig.height=5, fig.cap='This plot shows the density of number of affected users in cybersecurity attacks in each country in the year 2015. '}
# plot 3 Density plot of attacked industries
Year_data <- cyber %>%
  filter(Year == 2015)
p1<- ggplot(Year_data, aes(x = Number.of.Affected.Users, fill = Country)) +
  geom_density(alpha = 0.6) +
  labs(
    x = "Number of Affected Users",
    y = "Density",
    fill="Country",
    title = "Density Plot of Attacked Countries in 2015"
  ) +
  theme_minimal()
ggplotly(p1)
```

```{r fig.align='center', fig.width=7, fig.height=5, fig.cap="This plot shows the density of number of affected users in cybersecurity attacks in each country in the year 2024. "}
 Year_data1 <- cyber %>%
  filter(Year == 2024)
# plot 4 Density plot of Attacked countries
p2 <- ggplot(Year_data1, aes(x = Number.of.Affected.Users, fill = Country)) +
  geom_density(alpha = 0.6) +
  labs(
    x = "Number of Affected Users",
    y = "Density",
    fill="Country",
    title = "Density Plot of Attacked Countries in 2024"
  ) +
  theme_minimal()
ggplotly(p2)

```

Each country is given a specific color and the number of affected users is displayed on the x-axis. The density is displayed on the y-axis. Density in this plot shows the relative frequency of number of affected users per attack in each country.In 2015, the cyber attacks within the USA more frequently affected about 310,000 users, compared to Russia where attacks more frequently attacked 670,000 users and Germany where most attacks more frequently affected 790,000 users. In 2024, the number of affected users in the USA does not seem to be much more dense in any one number and is more dispersed. Japan had more frequent attacks that affected 500,000 and Brazil had it's most frequent at 750,000 users.

```{r fig.align='center', fig.width=7, fig.height=5, fig.cap='This plot shows the relationship between security vulnerability type and the interaction between the year of attack and resolution time. The plot shows how over the years resolution time has improved for certain vulnerability types, such as weak passwords, unpatched software, and social engineering. However, for zero-day there is an increase in resolution time, showing a positive relationship between zero-day and resolution time. '}

ggplot(cyber, aes(x = cyber$Year, y = cyber$Incident.Resolution.Time..in.Hours., color = cyber$Security.Vulnerability.Type)) +
geom_smooth(aes(x = cyber$Year, y = cyber$Incident.Resolution.Time..in.Hours., color = cyber$Security.Vulnerability.Type), method = "lm", se = FALSE) +
  scale_size(range = c(2, 10)) +
  labs(
    title = "Plot of Security Vulnerability, Year of Attack, and Resolution Time",
    x = "Year of Attack",
    y = "Incident Resolution Time in Hours",
    color = "Security Vulernability"
  ) +
  theme_minimal()

```

```{r fig.align='center', fig.width=7, fig.height=5, fig.cap='This plot focuses on zero-day vulnerability and shows the proportion of threat types in each industry. The x-axis shows each threat type and the y-axis shows the proportion of attacks within the dataset that fall into each category. The different colors are meant to show a subcategory, target industry. This shows the proportion of the dataset that falls within each category of threat type in each target industry.'}
#plot of zero-day attack type by target industry
zero_data <- cyber %>%
  filter(Security.Vulnerability.Type == "Zero-day") 
a <- ggplot(zero_data, aes(x = zero_data$Attack.Type, fill =zero_data$Target.Industry )) +
  geom_bar(position = "dodge") +
  labs(
    x = "Threat Type",
    y = "Proportion",
    fill = "Target Industry",
    title = "Zero-day Incidences by Threat Type vs. Target Industry"
  ) +
  theme_minimal()
ggplotly(a)
```

Zero-day attacks seem to commonly use Phising within both retail and IT industries, based on the height and count of the bar in these two categories. Ransomware is the most minimally used zero-day attack on the government, with a count of just 12.

```{r fig.align='center', fig.width=7, fig.height=5, fig.cap='The bloxplot shows the relationship between defense mechanism and resolution time in hours.'}

col0=c("#4682B4", "#B4464B","#9900FF","#FF66FF", "#009933")
boxplot(Incident.Resolution.Time..in.Hours. ~ Defense.Mechanism.Used,
        data=cyber,
        xlab="Defense Mechanism",
        ylab="Resolution Time (hours)",
        col = col0,
        main="Incident Resolution Time by Defense Mechanism",
        cex.main = 0.9,
        col.main = "blue")


```

The boxplot shows that the average resolution time among attacks based on defense mechanisms is relatively the same. The only exception seems to be firewall, which appears to take slightly less resolution time on average.We can indicate the average by looking the black horizontal line within each box. They all appear to line up with one another indicating similar averages.

## Skewness

```{r}

cyber$Country[cyber$Country=="USA"]<-1
cyber$Country<- ifelse(cyber$Country!="USA",0)
cyber$Security.Vulnerability.Type[cyber$Security.Vulnerability.Type=="Zero-day"]<-1
cyber$Security.Vulnerability.Type<-ifelse(cyber$Security.Vulnerability.Type!="Zero-day",0)
cyber$Number.of.Affected.Users<-as.numeric(cyber$Number.of.Affected.Users)
cyber$Year<-as.numeric(as.factor(cyber$Year))
cyber$Incident.Resolution.Time..in.Hours.<-as.numeric(cyber$Incident.Resolution.Time..in.Hours.)


c<-skewness(cyber$Financial.Loss..in.Million...)
d<-skewness(cyber$Number.of.Affected.Users)
e<-skewness(cyber$Year) 

Skewness = cbind(Financial.Loss = c, 
                 Affected.Users =  d, 
                 Year = e)

pander(Skewness)
```

The data is normally skewed and no transformation is necessary.

# Regularized Regression Modeling

For this analysis, we will explore three types of regularized linear regression modeling to find key predictors for financial loss from cyber security attacks.

```{r fig.align='center', fig.width=10, fig.height=12, fig.cap='The x-axis of the coefficient path analysis shows the log of lambda. A low lambda has less regularization and is closer to ordinary least squares. A high lambda indicates higher regularization and at this lambda many coefficients will shrink to zero. The Y-axis shows the coefficient values for each variable predictor for financial loss. The lines shows the path of each coefficient has lambda differs. The dashed line shows the optimal level of shrinkage. Coefficients to the right of this line are more forced towards zero. At this chosen lambda only certain variables will be non-zero and are deemed significant. '}




# Predictors (x) and Response Variable (y)

set.seed(12345)
X <- as.data.frame(cyber[, -5])  

y <- cyber$Financial.Loss..in.Million...


train_index <- createDataPartition(y, p = 0.8, list = FALSE)
X_train <- makeX(X[train_index, ])
X_test <- makeX(X[-train_index, ])
y_train <- y[train_index]
y_test <- y[-train_index]

#standardizing the data
preprocess_params <- preProcess(X_train, method = c("center","scale"))
X_train <- predict(preprocess_params, X_train)
X_test <- predict(preprocess_params, X_test)


#fitting the model
fit_lasso <- glmnet(X_train, y_train, alpha = 1)  # Lasso
fit_ridge <- glmnet(X_train, y_train, alpha = 0)  # Ridge
fit_elastic_net <- glmnet(X_train, y_train, alpha = 0.5)  # Elastic Net
#cross-validation
cv_lasso <- cv.glmnet(X_train, y_train, alpha = 1)       # Lasso CV
cv_ridge <- cv.glmnet(X_train, y_train, alpha = 0)       # Ridge CV
cv_elastic_net <- cv.glmnet(X_train, y_train, alpha = 0.5)  # Elastic Net CV

#plot coefficient path
par(mar=c(5,4,6,3))
# Plot coefficient path
plot(fit_lasso, xvar = "lambda", label = TRUE,
     lwd = 1.5,
     main = "Coefficient Path Analysis: LASSO",
     cex.main = 0.9,
     col = rainbow(10))
abline(v = log(1), col = "purple", lty = 4, lwd = 2)
abline(v = -1, col = "steelblue", lty = 2, lwd = 2)

```

```{r}

pander(data.frame(colnames(X_train)[c(27, 5, 20, 12)]),caption= "Feature Variables at log(Lambda) = 0")

pander(data.frame(colnames(X_train)[c(18, 12, 10, 24, 5, 23, 17, 11, 3)]),caption= " Feature Variables at log(Lambda) = -1")
```

The table above displays the feature variables observed in the Coefficient Analysis plot.

```{r fig.align='center', fig.width=7, fig.height=5, fig.cap='The plot of performance shows that as lambda increases, the MSE decreases. The two vertical lines give a reference of lambda. '}
par(mar=c(5,4,6,3))
##
plot(cv_lasso, main = "RMSE Plot: LASSO",
     cex.main = 0.9)
```

```{r}
# Extract coefficients for the best lambda
best.lasso.lambda <- cv_lasso$lambda.min
best.ridge.lambda <- cv_ridge$lambda.min
best.elastic.net.lambda <- cv_elastic_net$lambda.min
##
# Lasso Regression (L1 Regularization): 
# CAUTION: model formula differs from the regular regression formula 
lasso_model.opt <- glmnet(X_train, 
                      y_train, 
                      alpha = 1,      # lasso regression 
                      lambda = best.lasso.lambda,standardize = TRUE)   # useser selected alpha, optimal lambda
                                    # can be obtained through CV (see below)
lasso_predictions.opt <- predict(lasso_model.opt, 
                             s = best.lasso.lambda, # user selected lambda value 
                                      # (regularization paremeter)
                             newx = X_test)  # test data set 
# The following RMSE of prediction serves as a validation - one step validation
lasso_rmse.opt <- sqrt(mean((y_test - lasso_predictions.opt)^2))   
                                          
# Ridge Regression (L2 Regularization)
ridge_model.opt <- glmnet(X_train, y_train, alpha = 0, lambda = best.ridge.lambda,standardize=TRUE)
ridge_predictions.opt <- predict(ridge_model.opt, s = best.ridge.lambda, newx = X_test)
ridge_rmse.opt <- sqrt(mean((y_test - ridge_predictions.opt)^2))

# Elastic Net (Combination of L1 and L2)
elastic_net_model.opt <- glmnet(X_train, y_train, alpha = 0.5, lambda = 0.1,standardize = TRUE)
elastic_net_predictions.opt <- predict(elastic_net_model.opt, s = 0.1, newx = X_test)
elastic_net_rmse.opt <- sqrt(mean((y_test - elastic_net_predictions.opt)^2))

RMSE.opt = cbind(LASSO.opt = lasso_rmse.opt, 
                 Ridge.opt =  ridge_rmse.opt, 
                 Elasticnet.opt = elastic_net_rmse.opt)

pander(RMSE.opt)
```

The optimal lambda for each type of linear regression is shown above.

## LASSO Regression Equation

Now, a final model will be extracted using LASSO, Ridge, and Elastic Net.

```{r}
#Lasso
# Extract coefficients with names from glmnet
coef_lasso <- coef(cv_lasso, s = best.lasso.lambda)
coef_df <- as.data.frame(as.matrix(coef_lasso))
coef_df$feature <- rownames(coef_df)
colnames(coef_df)[1] <- "coefficient"

# Filter non-zero coefficients
nonzero_coefs <- subset(coef_df, coefficient != 0)

# Separate intercept
intercept <- round(nonzero_coefs$coefficient[nonzero_coefs$feature == "(Intercept)"], 4)
nonzero_terms <- subset(nonzero_coefs, feature != "(Intercept)")

# Build model equation string
equation <- paste0(round(nonzero_terms$coefficient, 4), "*", nonzero_terms$feature)
cat("Model equation: y =", intercept, "+", paste(equation, collapse = " + "), "\n")


```

## Ridge Regression Equation

```{r}

# Extract coefficients with names from glmnet
coef_ridge <- coef(cv_ridge, s = best.ridge.lambda)
coef_df <- as.data.frame(as.matrix(coef_ridge))
coef_df$feature <- rownames(coef_df)
colnames(coef_df)[1] <- "coefficient"

# Filter non-zero coefficients
nonzero_coefs <- subset(coef_df, coefficient != 0)

# Separate intercept
intercept <- round(nonzero_coefs$coefficient[nonzero_coefs$feature == "(Intercept)"], 4)
nonzero_terms <- subset(nonzero_coefs, feature != "(Intercept)")

# Build model equation string
equation <- paste0(round(nonzero_terms$coefficient, 4), "*", nonzero_terms$feature)
cat("Model equation: y =", intercept, "+", paste(equation, collapse = " + "), "\n")



```

## Elastic Net Regression Equation

```{r}

# Extract coefficients with names from glmnet
coef_elastic <- coef(cv_elastic_net, s = best.elastic.net.lambda)
coef_df <- as.data.frame(as.matrix(coef_elastic))
coef_df$feature <- rownames(coef_df)
colnames(coef_df)[1] <- "coefficient"

# Filter non-zero coefficients
nonzero_coefs <- subset(coef_df, coefficient != 0)

# Separate intercept
intercept <- round(nonzero_coefs$coefficient[nonzero_coefs$feature == "(Intercept)"], 4)
nonzero_terms <- subset(nonzero_coefs, feature != "(Intercept)")

# Build model equation string
equation <- paste0(round(nonzero_terms$coefficient, 4), "*", nonzero_terms$feature)
cat("Model equation: y =", intercept, "+", paste(equation, collapse = " + "), "\n")



```

```{r}
lasso.error <- min(cv_lasso$cvm)
ridge.error <- min(cv_ridge$cvm)
enet.error  <- min(cv_elastic_net$cvm)

model.errors <- data.frame(
  Model = c("LASSO", "Ridge", "Elastic Net"),
  CV_Error = c(lasso.error, ridge.error, enet.error)
)

pander(model.errors)

```

## Conclusion for Regularized Regression Modeling

The table above shows the MSE values or cross-validated errors for our three models. A smaller value indicates a model with greater performance. The LASSO linear regression model out performs both the Ridge and Elastic Net model. When predicting financial loss caused by cyber security threats, the LASSO liner regression model will be the most efficient.

$$

\text{Financial Loss in Millions} = 50.5369 + 4.818\times \text{Attack Type DDoS} - 0.4171\times \text{Target Industry Education} + 2.324\times \text{Target Industry Government} - 0.3464\times \text{Target Industry Healthcare} + 0.265\times \text {Attack Source Hacker Group} - 0.5335\times \text{Attack Source Insider} + 0.0065\times \text {Defense Mechanism Used Antivirus}

$$

# Support Vector Regression

For this section of the analysis, linear and radial basis functions will be fit using support vector regression and an ordinary least squares regression model will be fit using step-wise variable selection.

```{r}

cyber <- read.csv('https://raw.githubusercontent.com/GabbyK8/Datasets/refs/heads/main/Global_Cybersecurity_Threats_2015-2024.csv')

#One-Hot Encoding
dummy <- dummyVars("~.", data=cyber)
cyber2<-data.frame(predict(dummy, newdata=cyber))

#####
# Split data into features (X) and target (y)
X <- cyber2[, -25]  # All columns except the target variable
y <- cyber2[, 25]   # Target variable (Financial Loss in Millions)

# create dummy variables



#####
# Split data into training and testing sets
set.seed(123)
train.index <- sample(1:nrow(cyber2), 0.8 * nrow(cyber2))
X.train <- X[train.index, ]
y.train <- y[train.index]
X.test <- X[-train.index, ]
y.test <- y[-train.index]

#####
cyber2.train <- as.data.frame(cyber2[train.index, ])
cyber2.test <- cyber2[-train.index, ]
#####
## Set up custom cross-validation control
my_tune_control <- tune.control(
  cross = 5,  # Use 5-fold cross-validation, the default is 10-fold cross-validation
  nrepeat = 1 # Number of repetitions (for repeated cross-validation)
)
#####
# Perform grid search for hyperparameter tuning: RBF kernel is used
tune.RBF <- tune(svm,
                 train.x = X.train, 
                 train.y = y.train, 
                    ranges = list(epsilon = seq(0.1, 0.5, 0.1), 
                                  cost = c(1, 10, 100), 
                                  gamma = c(0.01, 0.1, 1)), # Hyperpar in RBF
                    tunecontrol = my_tune_control    # 5-fold cross-validation

                                    )

####
# Display the best parameters
#print(tune.RBF$best.parameters)
#####
# Train the final model using the best parameters
final.RBF<- svm(X.train, y.train, 
                   type = "eps-regression",  # Use "nu-regression" for nu-SVR
                   kernel = "radial", 
                   epsilon = tune.RBF$best.parameters$epsilon, 
                   cost = tune.RBF$best.parameters$cost, 
                   gamma = tune.RBF$best.parameters$gamma)
#####
# Make predictions on the test set
pred.RBF <- predict(final.RBF, X.test)

# Evaluate performance
mse.RBF <- mean((y.test - pred.RBF)^2)    # mean square error
mae.RBF <- mean(abs(y.test - pred.RBF))   # mean absolute error

#### linear kernel

# Perform grid search for hyperparameter tuning: RBF kernel is used
tune.lin <- tune(svm, train.x = X.train, train.y = y.train, 
                    ranges = list(epsilon = seq(0.1, 0.5, 0.1), 
                                  cost = c(1, 10, 100)), 
                    tunecontrol = my_tune_control # 5-fold cross-validation
                    )
####
# Display the best parameters
#print(tune.lin$best.parameters)
#####
# Train the final model using the best parameters
final.lin<- svm(X.train, y.train, 
                   type = "eps-regression",  # Use "nu-regression" for nu-SVR
                   kernel = "linear", 
                   epsilon = tune.lin$best.parameters$epsilon, 
                   cost = tune.lin$best.parameters$cost)
#####
# Make predictions on the test set
pred.lin <- predict(final.lin, X.test)

# Evaluate performance
mse.lin <- mean((y.test - pred.lin)^2)    # mean square error
mae.lin <- mean(abs(y.test - pred.lin))


RBF <- tune.RBF$best.parameters
Linear <- tune.lin$best.parameters
pander(RBF, caption='RBF Best Parameters')
pander(Linear, caption='Linear SVR Best Parameters')

```

The tables above show the fine tuned best parameters for both support vector regression models.

```{r fig.align='center', fig.width=7, fig.height=5, fig.cap='The plots are a diagnostic plot for the least squares regression model.'}

## ordinary LSE regression model with stepwise variable selection
lse.fit <- lm(Financial.Loss..in.Million...~.,data=cyber2.train)
AIC.fit <- stepAIC(lse.fit,direction="both", trace = FALSE)
pred.lse <- predict(AIC.fit, X.test)
mse.lse <- mean((y.test - pred.lse)^2)    # mean square error
mae.lse <- mean(abs(y.test - pred.lse))   # mean absolute error
###
par(mfrow=c(2,2), mar=c(2,2,2,2))
plot(AIC.fit)

```

The model above satisfies the linear regression assumptions and will be used to help compare performance.

## Comparing Regression Models

In this section we will look at the performance of our least squares regression model after stepwise variable selection with our linear and radial basis function models from support vector regression.We will compare the model using MSE and MAE values.

```{r}
###
Performance <- data.frame(RBF.SVR=c(mse.RBF, mae.RBF),
                          Linear.SVR = c(mse.lin, mae.lin),
                          LSE.Reg =c(mse.lse, mae.lse))
row.names(Performance) <- c("MSE", "MAE")
##
pander(Performance)
```

The linear and least squares model performed better than the radial basis function model. The two models performed similarly, but the least squares model performed slightly better and will be chosen as the best model for the data.

## Conclusion for Support Vector Regression

Based on our MSE and MAE scores, our least squares estimation has the best model based on performance in comparison to support vector regression on linear and radial basis function. It has lower MSE and MAE scores indicating a better fit model. Our final best model is:

$$
\text{Financial Loss in Millions} = 51.27 + 4.818\times \text{CountryGermany} - 3.525\times \text{CountryIndia} - 2.324\times \text{AttackSourceInsider}
$$
